Fix probabilistic variance underflow#527
Draft
Panchadip-128 wants to merge 3 commits intomllam:mainfrom
Draft
Conversation
e6e74b2 to
bb30e20
Compare
|
Nice catch on the NaN crashes! I’ve definitely been there with PINNs and seen how one zero variance can just blow up a whole training run. Clamping at 1e-6 is a solid move for keeping the NLL/CRPS stable. Also, good to see that 0.01 noise replaced with the actual pred_std in the ensemble logic. I’m curious, did you run into these NaN crashes mostly during the initial epochs or during longer auto-regressive rollouts? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
This PR resolves two critical numerical and logical flaws in the probabilistic forecasting (
--output_std) engine:softplusUnderflow (NaNcrashes): Inneural_lam/models/base_graph_model.py, the standard deviation calculation is now clamped to a minimum of1e-6. This prevents the network from producing a machine-zero variance which previously caused division-by-zero errors in NLL and CRPS metrics, leading to irreversibleNaNtraining losses.ARModel._sample_ensemblemethod inneural_lam/models/ar_model.pywas previously discarding the model's predicted uncertainty map in favor of a hardcoded0.01noise fallback. This has been refactored to utilize the model's dynamically predictedpred_stdfor physically grounded ensemble generation.test_base_graph_model_prevents_softplus_underflow_nansandtest_ar_model_ensemble_samples_from_pred_stdtotests/test_probabilistic_forecasting.pyto assert that these mathematical stability and logic requirements are met.Motivation and Context: These bugs made probabilistic training unstable and produced deceptively uniform ensemble spreads that ignored the model's internal confidence.
Dependencies: No new dependencies.
Issue Link
solves #526
Type of change
Checklist before requesting a review
upstream/mainrebased and force-pushed).Author checklist after completed review
CHANGELOG.mddescribing this change.